CLSDASSep 25, 2023

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

CMUMeta AINVIDIA
arXiv:2309.13876v374 citationsh-index: 83Has Code
Originality Synthesis-oriented
AI Analysis

This work provides an accessible pipeline for researchers to study and improve speech models, addressing issues like efficiency and bias, but it is incremental as it replicates an existing approach.

The authors tackled the problem of reproducing OpenAI Whisper's speech model training by developing OWSM, an open-source version using publicly available data, which supports more translation directions and is more efficient to train.

Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.

Code Implementations7 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes