SDCLASMay 31, 2019

Audio Caption in a Car Setting with a Sentence-Level Loss

arXiv:1905.13448v23 citations
Originality Incremental advance
AI Analysis

This work addresses audio captioning for Mandarin in specific domains like cars and hospitals, but it is incremental as it builds on existing methods with a new loss and dataset.

The paper tackles audio captioning in a car setting by proposing a sentence-level loss with a GRU encoder-decoder model, resulting in improved metrics like NLG scores and human ratings across datasets, though human annotations still outperform the model.

Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes