SDLGMMASOct 21, 2020

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

arXiv:2010.11098v119 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of generating textual descriptions from audio for applications like accessibility and multimedia indexing, representing an incremental but measurable advance in the field.

The paper tackles automated audio captioning by proposing WaveTransformer, a novel architecture that explicitly learns temporal and time-frequency patterns from audio. The method achieves a SPIDEr score of 17.3 on the Clotho dataset, improving over the previous best of 16.2.

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes