AS CL SDDec 14, 2020

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda

arXiv:2012.07331v114.547 citations

Originality Incremental advance

AI Analysis

This work provides an incremental solution for researchers and practitioners in audio captioning facing limited training data.

This paper addresses the data scarcity in audio captioning by leveraging a pre-trained large-scale language model. They guide the language model with retrieved similar captions from a training dataset, demonstrating that this approach successfully utilizes pre-trained models for audio captioning and outperforms conventional methods trained from scratch in oracle performance.

The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted into such a language model, we utilize guidance captions retrieved from a training dataset based on similarities that may exist in different audio. Then, the caption of the audio input is generated by using a pre-trained language model while referring to the guidance captions. Experimental results show that (i) the proposed method has succeeded to use a pre-trained language model for audio captioning, and (ii) the oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.

View on arXiv PDF

Similar