AS CL LG SDAug 23, 2023

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

arXiv:2308.11923v14.312 citationsh-index: 28Has Code

Originality Incremental advance

AI Analysis

This addresses a specific issue in audio captioning for applications requiring fine-grained audio analysis, but it is incremental as it extends an existing task with new methods and data.

The paper tackles the problem of conventional audio captioning generating similar captions for similar audio clips by proposing Audio Difference Captioning (ADC) to describe semantic differences between pairs of similar audio clips, and shows that the proposed methods effectively solve this task on a new AudioDiffCaps dataset.

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.

View on arXiv PDF Code

Similar