SDCLASJan 10, 2024

MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector

arXiv:2401.05060v239 citationsh-index: 35ACL
Originality Incremental advance
AI Analysis

This addresses the problem of limited toxicity detection in non-English audio for researchers and practitioners, representing a strong domain-specific advancement.

The authors tackled the lack of multilingual audio-based toxicity detection by introducing MuTox, a dataset and zero-shot classifier, which outperforms text-based methods by over 1% AUC and wordlist-based methods by about 2.5 times in precision and recall.

Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 19 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier outperforms existing text-based trainable classifiers by more than 1% AUC, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves precision and recall by approximately 2.5 times. This significant improvement underscores the potential of MuTox in advancing the field of audio-based toxicity detection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes