CLFeb 10, 2025

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

arXiv:2502.06394v111 citationsh-index: 5Has CodeNAACL
Originality Highly original
AI Analysis

This work addresses the scarcity of parallel multilingual datasets for text detoxification, which is a problem for researchers and developers working on multilingual text processing tasks.

The authors tackled the problem of multilingual text detoxification by introducing SynthDetoxM, a dataset of 16,000 multilingual parallel text detoxification sentence pairs, and achieved superior performance to existing datasets, with models trained on SynthDetoxM outperforming all evaluated LLMs in few-shot setting.

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes