SDAILGSep 30, 2025

Representation-Based Data Quality Audits for Audio

arXiv:2509.26291v1h-index: 11
Originality Synthesis-oriented
AI Analysis

This addresses data quality problems for audio-based ML systems, but is incremental as it adapts an existing framework to a new domain.

The paper tackles data quality issues like off-topic samples and label errors in audio systems by adapting the SelfClean framework from images to audio, using self-supervised representations to rank issues. It achieves state-of-the-art ranking performance on benchmarks like ESC-50 and GTZAN, enabling significant annotation savings.

Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This paper addresses these issues by adapting SelfClean, a representation-to-rank data auditing framework, from the image to the audio domain. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes