SDLGASSep 19, 2024

A quest through interconnected datasets: lessons from highly-cited ICASSP papers

arXiv:2410.03676v1h-index: 8
Originality Synthesis-oriented
AI Analysis

This work highlights a critical gap in accountability for data provenance in applied machine learning, particularly as models grow larger, but it is incremental as it builds on existing concerns without introducing new methods.

The study analyzed dataset usage in top-cited ICASSP papers to assess data quality and origins, revealing unclear or entangled data provenance despite the importance for societally impactful audio machine learning applications.

As audio machine learning outcomes are deployed in societally impactful applications, it is important to have a sense of the quality and origins of the data used. Noticing that being explicit about this sense is not trivially rewarded in academic publishing in applied machine learning domains, and neither is included in typical applied machine learning curricula, we present a study into dataset usage connected to the top-5 cited papers at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). In this, we conduct thorough depth-first analyses towards origins of used datasets, often leading to searches that had to go beyond what was reported in official papers, and ending into unclear or entangled origins. Especially in the current pull towards larger, and possibly generative AI models, awareness of the need for accountability on data provenance is increasing. With this, we call on the community to not only focus on engineering larger models, but create more room and reward for explicitizing the foundations on which such models should be built.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes