SDLGASMay 15, 2023

A Whisper transformer for audio captioning trained with synthetic captions and transfer learning

arXiv:2305.09690v117 citations
Originality Synthesis-oriented
AI Analysis

This work addresses audio captioning for applications like accessibility or multimedia indexing, but it appears incremental as it builds on existing Whisper models and synthetic data techniques.

The researchers tackled audio captioning by using a pretrained Whisper speech-to-text model with synthetic captions and transfer learning, finding that different training strategies significantly impact performance.

The field of audio captioning has seen significant advancements in recent years, driven by the availability of large-scale audio datasets and advancements in deep learning techniques. In this technical report, we present our approach to audio captioning, focusing on the use of a pretrained speech-to-text Whisper model and pretraining on synthetic captions. We discuss our training procedures and present our experiments' results, which include model size variations, dataset mixtures, and other hyperparameters. Our findings demonstrate the impact of different training strategies on the performance of the audio captioning model. Our code and trained models are publicly available on GitHub and Hugging Face Hub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes