CLSDASJun 13, 2024

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

arXiv:2406.09282v113 citations
Originality Incremental advance
AI Analysis

This work addresses data quality issues for speech-to-text models, but it is incremental as it builds on prior OWSM versions.

The study tackled the problem of data heterogeneity in speech-to-text foundation models by introducing OWSM v3.2, which improved performance over the baseline while using 15% less training data through strategies like data filtering and punctuation incorporation.

The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes