CLJul 21, 2025

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

arXiv:2507.15557v12.7h-index: 16

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of unreliable evaluation in text generation for researchers and practitioners, offering a multilingual benchmark that is incremental but extends prior English-focused efforts.

The paper tackles the challenge of evaluating text style transfer, specifically text detoxification, by conducting the first comprehensive multilingual study across nine languages, assessing neural-based models and LLM-as-a-judge approaches to provide a practical pipeline for more reliable evaluation.

Despite recent progress in large language models (LLMs), evaluation of text generation tasks such as text style transfer (TST) remains a significant challenge. Recent studies (Dementieva et al., 2024; Pauli et al., 2025) revealed a substantial gap between automatic metrics and human judgments. Moreover, most prior work focuses exclusively on English, leaving multilingual TST evaluation largely unexplored. In this paper, we perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic. Drawing inspiration from the machine translation, we assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches. Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline in the text detoxification case.

View on arXiv PDF

Similar