CLFeb 7, 2025

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?

arXiv:2502.04718v211 citationsh-index: 30NAACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable automatic evaluation for text style transfer, which is crucial for NLP researchers and practitioners, though it is incremental as it builds on existing metrics and methods.

The paper tackled the problem of evaluating text style transfer by examining existing and novel metrics, including large language models, across sentiment transfer and detoxification tasks in multiple languages, finding that advanced NLP metrics and LLM-based evaluations offer better insights than current TST metrics, with oracle ensembles showing further potential.

Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks, sentiment transfer and detoxification, in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics and LLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes