SDLGASJan 28, 2022

Automatic Audio Captioning using Attention weighted Event based Embeddings

arXiv:2201.12352v1
Originality Incremental advance
AI Analysis

This work addresses the problem of generating natural language descriptions from audio for applications like accessibility or multimedia indexing, but it is incremental as it builds on existing transfer learning trends in the field.

The paper tackles automatic audio captioning by proposing an encoder-decoder architecture with lightweight Bi-LSTM layers and pre-trained audio event detection models as embedding extractors, achieving results that surpass existing literature with computationally intensive architectures.

Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes