CLSDASAug 10, 2021

Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization

arXiv:2108.04692v122 citations
Originality Highly original
AI Analysis

This work addresses the problem of generating descriptive captions for audio content, which is important for applications in accessibility and multimedia analysis, representing an incremental advance in the field.

The paper tackles automated audio captioning by combining transfer learning from pretrained audio neural networks with a novel self-supervised regularization method, achieving state-of-the-art results on the Clotho dataset with significant improvements across multiple metrics.

In this paper, we examine the use of Transfer Learning using Pretrained Audio Neural Networks (PANNs), and propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task. We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR). The RLSSR module supplements the training of the model by minimizing the similarity between the encoder and decoder embedding. The combination of both methods allows us to surpass state of the art results by a significant margin on the Clotho dataset across several metrics and benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes