SD MM ASOct 12, 2021

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Zhongjie Ye, Helin Wang, Dongchao Yang, Yuexian Zou

arXiv:2110.06100v114.231 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating meaningful descriptions for audio clips, which is important for applications in accessibility and multimedia analysis, though it is incremental in nature.

The paper tackles the problem of automated audio captioning by integrating both acoustic and semantic information, achieving state-of-the-art performance on the Clotho dataset.

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoder-decoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper proposes a novel approach for automated audio captioning based on incorporating semantic and acoustic information. Specifically, our audio captioning model consists of two sub-modules. (1) The pre-trained keyword encoder utilizes pre-trained ResNet38 to initialize its parameters, and then it is trained by extracted keywords as labels. (2) The multi-modal attention decoder adopts an LSTM-based decoder that contains semantic and acoustic attention modules. Experiments demonstrate that our proposed model achieves state-of-the-art performance on the Clotho dataset. Our code can be found at https://github.com/WangHelin1997/DCASE2021_Task6_PKU

View on arXiv PDF Code

Similar