Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer
For researchers and practitioners in text style transfer, this work provides a new benchmark and evaluation insight, though it is incremental in applying existing models to a novel task.
The authors built a parallel corpus of 25,140 AI/human text pairs and fine-tuned BART and Mistral models for AI-to-human style transfer. BART-large achieved the best reference similarity (BERTScore F1 0.924, ROUGE-L 0.566, chrF++ 55.92) with 17x fewer parameters than Mistral-7B, while Mistral's higher marker shift score reflected overshoot rather than accuracy.
AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity -- BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 -- with 17x fewer parameters than Mistral-7B. We show that Mistral-7B's higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.