CL AIAug 29, 2025

Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards

Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin

arXiv:2508.21476v117.610 citationsh-index: 10Has CodeEMNLP

Originality Incremental advance

AI Analysis

It addresses the challenge of making creative writing more accessible and scalable for SLMs, offering a more efficient alternative to costly methods like RLHF, though it is incremental in improving existing RLAIF techniques.

This paper tackles the problem of enhancing creative writing in Small Language Models (SLMs) by exploring two AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework, specifically for generating Chinese greetings, and finds that a principle-guided LLM-as-a-Judge approach yields superior generation quality with advantages in training efficiency and reduced human data dependency.

Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.

View on arXiv PDF Code

Similar